1 ) Introduction

The following Tutorial is the final assessment of the project seminar “Species Distribution Modeling” at Philipps-University Marburg. In this tutorial we’re going to use the XGBoost algorithm to predict the specie´s distribution of butterflies in Pakistan and create a species richness map of the country. XGBoost (eXtreme Gradient Boosting) is a popular machine learning algorithm that belongs to the family of gradient boosting methods. It was developed by Tianqi Chen. and uses a combination of gradient boosting, decision trees, regularization, gradient-based optimization, feature importance analysis, parallelization. All this make’s it a robust and powerful algorithm that often delivers state-of-the-art results in various machine learning tasks. You will be introduced to the basic concepts of XGBoost and you’ll be provided with a reproducible workflow to use XGBoost to build classification models.

2 ) But what’s XGBoost?

XGBoost is a ensemble Method same as Random Forrest, this means it combines the output of multiple Trees. But the methods differ in the way the idividual Trees are build and how the results are combined. In Xgboost the Output oft the Trees aren’t combined equally. Instead XGBoost uses a method called boosting. Boosting combines weak learner (small trees) sequentually so that the new tree corrects the errors oft he previous one. To understand this we have to look into some mathematical details. But dont worry when using XGBoost these details will be automated. Nevertheless its importand to understand these processes to optimize the algorithm later on.

3 ) So how does it exactly work?

As said XGBoost builds on many concepts to deliver it’s often outstanding results. We’re going to start with the mathematical base concepts of how XGBoost builds trees.

In this assessment were trying to classify geo-points if they’re potential habitats for a number of butterfly species. In order to do that we need XGBoost to build classification trees. XGBoost work’s also with regression but the process differs a little, so were not going to focus on that.


When building classification trees XGBoost evaluates each given point if a condition is met or not. In our case the green dots, with a value of one, represent the presence points. Red dots, with values of zero represent absence points. The black line in the middle is XGBoost’s initial prediction. By default, it is 0.5, which means there is a 50% chance to find a butterfly at any given point.
In the next step XGBoost calculates the residuals of all given points. The residuals are the difference between the observed prediction and the predictet. If the observed value is one (i.e., a presence point), the residuals are 0.5. The same applies to values of zero (i.e., absence points) where the residuals are -0.5.
In order to find the threshhold of parameters that influence the probability of a point being a presence or absence point the data has to be split at the crucial values of that parameter.
Therefore XGBoost splits the observations at thresholds that result in the highest gain value. But whats the gain-value?
To calculate the gain-value we need the similarity-score of each leaf and the root.
By summing the similarity-scores of the left and the right leaf and then substracting the similarity-score of the root we get the gain-value. In this case it’s 3.8 and the highest possible for this dataset therefore this would be the final tree. But why doesn’t XGBoost split the residuals any further? This has to do with regularization parameters and pruning wich we’re going to explain in the next chapter.

3.1 ) Regularization & pruning

How XGBoost builds trees is limited by multiple regularization parameters:

3.1.1 ) Lambda

We’ve heard of Lambda when we’re calculated the similarity-score. XGBoost default value for Lambda is 0 therefore we’ve been ignoring it. But when Lambda is set to >0 the similarity-score get’s smaller because the denominator becomes larger. Thus Lambda prevents over-fitting.

3.1.2 ) Cover/min_child_weigth

Another regularization parameter is the Cover or min_child_weight. This parameter is also the reason why we haven’t continued building our example tree. In XGBoost the default value for the cover is 1 wich means that every leaf with a cover value less than 1 doesn’t get build. The cover for classification tree’s is calculated by summing the previous probability times 1 minus the previous probability, for each residual in the leaf.

3.2 ) Pruning

Similar to Cover (min_child_weigth) Gamma is a regularization parameter that causes XGBoost to only build new leafs when the Gain-Value is larger than Gamma.

Therefore it prevents overfitting the trees to our data. Gamma is a highly specialized regularization parameter, what mean’s that there is no “good” value. By default it’s 0 therefore no regularization takes place.

XGBoost substract’s gamma from the Gain-Value and then removes the leaf if the result is a negative number. For example if we take the previous calculated Gain-Value of our example tree of 3.8 a gamma-value of 4 would prune the whole tree down to the root. But if the gamma-value is just 0 XGBoost can build extremly large trees thus overfitting the trees to the dataset and raising the computation time a lot.

4 ) Application of XGBoost in R

XGBoost is available for different programming and scripting languages, include Python, R, Java and C++. Docuementation is available here: https://xgboost.readthedocs.io/en/stable/

4.1 ) Prerequisites

Before you dive into the code you need to install some packages this script will use:

install.packages("terra")
install.packages("ggplot2")
install.packages("fastDummies")
install.packages("tidyterra")

The XGBoost package can by installed in two different ways: * First there is the default package from CRAN, which will do it in most situations.

install.packages("xgboost")
  • But if you are dealing with large data sets you may want to use GPU acceleration. Therefor you have to use a prebuild package from GITHUB (https://github.com/dmlc/xgboost/releases). Download it, place it in the same folder as this script and run the commands below.

TODO: For Benchmarks of a High-End CPU vs Low-End GPU, see https://medium.com/data-design/xgboost-gpu-performance-on-low-end-gpu-vs-high-end-cpu-a7bc5fcd425b. Risks? Reproducability?

##
## !! Installation on windows failed with "Warning in system("sh ./configure.win") 'sh' not found", for dirty Solution see section Troubleshooting
##

# Install dependencies
install.packages(c("data.table", "jsonlite"))

# Install XGBoost
system(paste("R CMD INSTALL ", getwd(),  "/xgboost_r_gpu_win64_21d95f3d8f23873a76f8afaad0fee5fa3e00eafe.tar.gz", sep=""))

Don’t forget to enable CPU acceleration in the knitting paramters.

After installing all needed libaries you need to load them:

require(dplyr)      # easy dataframe modification
require(ggplot2)    # plotting

require(geodata)    # downloading geospatial world dataset made easy
require(sf)         # simple geospatial features
require(terra)      # raster manipulation

require(tidyterra)  # plot terra objects with ggplot

require(fastDummies)# create binary factor columns from character column
require(xgboost)    # our modeling libary

4.2 ) Data preparation

Let’s begin with preparing the data used to train the model. Start with getting a overview of the provided data:

species_occurrences_all <- read.table("data/PakistanLadakh.csv", sep=",", header=TRUE)
species_occurrences_all <- sf::st_as_sf(species_occurrences_all, coords=c("x", "y"), remove=TRUE, crs=sf::st_crs("epsg:4326"))

str(species_occurrences_all)
## Classes 'sf' and 'data.frame':   5870 obs. of  2 variables:
##  $ species : chr  "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" ...
##  $ geometry:sfc_POINT of length 5870; first list element:  'XY' num  73.1 34.4
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
##   ..- attr(*, "names")= chr "species"
ggplot() + 
  geom_sf(data = sf::st_as_sf(species_occurrences_all), mapping=aes(color=species), show.legend = FALSE) +
  ggtitle("Oberserved occurrence of butterflies in Pakistan")

Next we need some environmental data to train the model. Therefor we selected the bioclim data, which are widely used in speceis distribution modeling (Source: https://isprs-archives.copernicus.org/articles/XLII-4-W19/449/2019/isprs-archives-XLII-4-W19-449-2019.pdf#:~:text=The%20earliest%20studies%20of%20SDM%20used%20BIOCLIM%20-,requires%20species%20occurrence%20data%20%28latitude%2C%20longitude%2C%20and%20elevation%29.). The Bioclim layers are missing elevation data, we will use those too, since temperature is dependent on elevation. The Border of Pakistan is also needed, we will crop our data with that, so the model doesn’t train areas we don’t not have presence points. The elevation data is alread cropped so we don’t need to repeat this.

# political border of pakistan
border_pak <- geodata::gadm(country='PAK', level = 0, path='./data')
ggplot() +   
  geom_sf(data = sf::st_as_sf(border_pak), fill=NA)

# bioclim data from pakistan
bioclim_pak <- geodata::worldclim_country(country = "PAK", res = 10, var = "bio", path = "data/", version = "2.1")
names(bioclim_pak) <- substr(names(bioclim_pak), 11, 20) # TODO: rename to mare meaningful names or show table of layers in text
bioclim_pak <- terra::mask(bioclim_pak, border_pak)
ggplot() + 
  geom_spatraster(data = bioclim_pak) +
  facet_wrap(~lyr) +
  geom_sf(data = sf::st_as_sf(border_pak), fill = NA, show.legend = FALSE) + 
  ggtitle("Bioclim data of Pakistan")

# elevation data form pakistan
elevation_pak <- geodata::elevation_30s(country = 'PAK', path = 'data/')
ggplot() + 
  geom_spatraster(data = elevation_pak) +
  geom_sf(data = sf::st_as_sf(border_pak), fill = NA, show.legend = FALSE) +
  scale_fill_hypso_c(name = "Elevation")

So lets define our species_occurrences as presence points :

species_presence <- species_occurrences_all
rm(species_occurrences_all)

As absence points we will user random Points in Pakistan and combine them with the presence points. The absence points will be extended by a column “species”, which matches the column “species”´in the presence points.

# Generate random points inside pakistan as background points and extend them with a column for species = NA
# TODO: why 1000 points?? give a explanation for the decision
border_pak <- sf::st_as_sf(border_pak)
species_absence <- sf::st_sample(border_pak, size = 1000)

# TODO: add col for occurrence = 1 or occurrence = 0 here, dummycols can then be thrown away
# adding col species = NA to the background points, needed for rbind to join the data
species_absence <- cbind(species_absence, data.frame(species = as.character(NA)))
species_absence <- sf::st_as_sf(species_absence)

# Combine presence and absence (background) points into a single object
modeling_data_ <- rbind(species_presence, species_absence)

# Only points inside Pakistan should be used for modeling, also remove the columns added by the intersection. 
modeling_data_ <- sf::st_intersection(modeling_data_, border_pak) %>% select(-COUNTRY, -GID_0)

Now we got our presence and absence points as spatial data. Finally we will extract values from our environmental data and add those to our modeling data, so xgboost can use this table to train its model.

# Extract values from bioclim and elevation, join them to our modeling_data
extraction_bioclim_pak <- terra::extract(bioclim_pak, modeling_data_, bind=FALSE, ID=FALSE)
extraction_elevation_pak <- terra::extract(elevation_pak, modeling_data_, bind=FALSE, ID=FALSE)
modeling_data_extracted <- cbind( modeling_data_, extraction_bioclim_pak, extraction_elevation_pak)

Clean up of no longer needed variables and check the final modeling data:

# create a final data variable and clean up variables
modeling_data <- modeling_data_extracted

rm(species_occurrences_all); rm(species_presence); rm(species_absence); rm(modeling_data_); rm(extraction_bioclim_pak); rm(extraction_elevation_pak); rm(modeling_data_extracted)
## Warning in rm(species_occurrences_all): Objekt 'species_occurrences_all' nicht
## gefunden
str(modeling_data)

ggplot() + 
  geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
  geom_sf(data = sf::st_as_sf(modeling_data), mapping=aes(color=species), show.legend = FALSE) +
  ggtitle("Oberserved occurrence of butterflies in Pakistan plus background points")

## Classes 'sf' and 'data.frame':   1058 obs. of  22 variables:
##  $ species    : chr  "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Aglais_caschmirensis" ...
##  $ bio_1      : num  17.7 12.25 15.35 16.44 -3.51 ...
##  $ bio_2      : num  12.12 10.44 11.33 11.86 9.16 ...
##  $ bio_3      : num  37.9 36.5 37.7 36.8 27.8 ...
##  $ bio_4      : num  719 682 682 765 852 ...
##  $ bio_5      : num  33.8 26.1 30.1 32.1 13.3 ...
##  $ bio_6      : num  1.8 -2.5 0 -0.1 -19.6 ...
##  $ bio_7      : num  32 28.6 30.1 32.2 32.9 ...
##  $ bio_8      : num  24.7 19.2 22 11.2 -10.4 ...
##  $ bio_9      : num  13.98 9.37 12.07 12.6 -1.78 ...
##  $ bio_10     : num  26.02 20.15 23.17 25.47 6.92 ...
##  $ bio_11     : num  8.3 3.37 6.37 6.7 -13.65 ...
##  $ bio_12     : num  1358 1376 1005 699 707 ...
##  $ bio_13     : num  275 239 169 117 122 80 74 75 116 107 ...
##  $ bio_14     : num  28 26 22 16 18 14 13 11 20 18 ...
##  $ bio_15     : num  68.9 57.8 55.9 55.8 57.2 ...
##  $ bio_16     : num  628 585 413 276 319 213 197 196 308 280 ...
##  $ bio_17     : num  119 133 99 74 82 49 45 45 91 70 ...
##  $ bio_18     : num  619 572 406 210 107 63 53 57 110 116 ...
##  $ bio_19     : num  251 259 195 139 205 101 91 97 210 144 ...
##  $ PAK_elv_msk: num  1219 2315 1581 1628 4563 ...
##  $ geometry   :sfc_POINT of length 1058; first list element:  'XY' num  73.1 34.4
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:21] "species" "bio_1" "bio_2" "bio_3" ...
species_nsamples = data.frame(modeling_data) %>% 
                    count(species, sort=TRUE) %>% 
                    rename(n_samples = n) %>% 
                    filter(!is.na(species))

ggplot(species_nsamples, aes(n_samples)) +
       geom_histogram(binwidth = 5) +
       geom_vline(aes(xintercept=mean(n_samples)), linetype="dashed") +
       annotate(x=mean(species_nsamples$n_samples), y=+Inf, label=paste("Mean:",round(mean(species_nsamples$n_samples),2)), vjust=3, geom="label") +
       labs(x = "Number of Samples", y = "Number of Species")

rm(species_nsamples)

4.3 ) Training and prediction with xgboost

Starting with the training of our xgboost model we decided to do sperate training for every model. Despite that xgboost is capable of Multi-Classificiation. Therefor we defined a function ‘train’ wich invokes out data filtering and xgboost specific data preperation to meet the requirements of xgboost. Espacially converting the modeling data into a ‘xgb.DMatrix’ object. Finally we define some general parameters we want to use, e.g. enable of gpu acceleration by knit paramters. Last but not least we save the model to the disk to preserve it for modeling.

train <- function(
    data, # training data, required
    sp, # species name, required
    xgb_params = list("nrounds" = 1000), # xgboost params, see https://xgboost.readthedocs.io/en/latest/parameter.html
    save_model = TRUE 
) {
# filter modeling data to current species, don't forget the absence points!
data <- modeling_data %>% filter(species == sp | is.na(species))
data <- dummy_columns(data, select_columns = "species",
                      ignore_na = TRUE, 
                      remove_selected_columns = TRUE)
  
# make dummy_columns name persistent over all species
# replace NA values with 0
data <- data %>% 
  rename(species_occurrence = starts_with("species_")) %>%
  mutate(species_occurrence = ifelse(is.na(species_occurrence), 0, species_occurrence)) %>%
  mutate(species_occurrence = factor(species_occurrence))
  
# xgboost need a specific data format  
data <- xgb.DMatrix(
  data = as.matrix(data %>% select(-species_occurrence, -geometry)), 
  label = as.matrix(data %>% select(species_occurrence))  
)

# enable gpu acceleration only if wanted
if(params$gpu_acc) {
  xgb_params = c(xgb_params, predictor="gpu_predictor")
  xgb_params = c(xgb_params, tree_method="gpu_hist")
  xgb_params = c(xgb_params, sampling_method = "gradient_based")
}
#xgb_params = c(xgb_params, eval_metric="error") # Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
xgb_params = c(xgb_params, objective = "binary:logistic") # logistic regression for binary classification, output probability

model <- xgboost(data = data,
                 verbose = 0,            # 0 (silent), 1 (warning), 2 (info), 3 (debug)
                 nrounds = xgb_params$nrounds,
                 params = xgb_params
                 )

message(paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])))
  
if(save_model)
{
  xgb.save(model, paste("out/", sp, ".model", sep = ""))  # Max compatibility with future xgboost versions
  save(model, file = paste("out/", sp, ".rds", sep = "")) # Fully restorable r object
}
  
return(model)
}

Additional training paramters are defined here in a own data frame. Advantage of this are the comparability and useability of the different parameter sets. The Table show the default parameter, one set taken from Roozbeh Valavi et. al. And last the parameters we will user for training. Those have been tested and tuned manually.

rm(xgboost_params)
## Warning in rm(xgboost_params): Objekt 'xgboost_params' nicht gefunden
xgboost_params <- data.frame("dataset" = character(), 
                             "nrounds" = numeric(), 
                             "eta" = numeric(), 
                             "max_depth" = numeric(), 
                             "subsample" = numeric(), 
                             "gamma" = numeric(),
                             "alpha" = numeric(),
                             "lambda" = numeric(),
                             "colsample_bytree" = numeric(), 
                             "min_child_weight" = numeric()
                             )

xgboost_params <- xgboost_params %>% add_row(dataset = "default", eta = 0.3, max_depth = 6, gamma = 0, alpha = 0, lambda = 1, subsample = 1, colsample_bytree = 1, min_child_weight = 1)

# Parameter taken from https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecm.1486
xgboost_params <- xgboost_params %>% add_row(dataset = "literature", nrounds = 1000, eta = 0.001, max_depth = 5, subsample = 0.75, gamma = 0, alpha = 0, lambda = 1, colsample_bytree = 0.8, min_child_weight = 1)

xgboost_params <- xgboost_params %>% add_row(dataset = "tuned 1", nrounds = 2000, eta = 0.01, max_depth = 5, subsample = 1, gamma = 0, alpha = 0, lambda = 1, colsample_bytree = 1, min_child_weight = 1)

xgboost_params <- xgboost_params %>% add_row(dataset = "tuned 2", nrounds = 5000, eta = 0.3, max_depth = 5, subsample = 1, gamma = 2, alpha = 0, lambda = 1, colsample_bytree = 1, min_child_weight = 1)

xgboost_params <- xgboost_params %>% add_row(dataset = "tuned 3", nrounds = 3000, eta = 0.15, max_depth = 7, subsample = 0.75, gamma = 1.5, alpha = 0, lambda = 0.5, colsample_bytree = 1, min_child_weight = 1)

print(xgboost_params)
##      dataset nrounds   eta max_depth subsample gamma alpha lambda
## 1    default      NA 0.300         6      1.00   0.0     0    1.0
## 2 literature    1000 0.001         5      0.75   0.0     0    1.0
## 3    tuned 1    2000 0.010         5      1.00   0.0     0    1.0
## 4    tuned 2    5000 0.300         5      1.00   2.0     0    1.0
## 5    tuned 3    3000 0.150         7      0.75   1.5     0    0.5
##   colsample_bytree min_child_weight
## 1              1.0                1
## 2              0.8                1
## 3              1.0                1
## 4              1.0                1
## 5              1.0                1

Next is to perpare data used to predic species occurrence in pakistan. Therefore we will use the raw raster data and predict of those with ‘terra::predict’ which allows us to pass on a ‘Spatraster’ object. XGBoost can’t handle Spatraster, so ‘terra:predict’ allows as to define a custom prediction function, which converts data into a matrix.

# gen stack from rasters bioclim_pak and elevation_pak
elev_pak = resample(elevation_pak, bioclim_pak)
ext(elevation_pak) <- ext(bioclim_pak)
prediction_rstack = c(bioclim_pak, elev_pak)

# Remove values outside pakistan, because otherwise the model will make predictions outside the modeling area
prediction_rstack = terra::mask(prediction_rstack, border_pak)
# We need to make a custom predict function for terra::predict() since xgboost didn't take a data.frame as input. See https://stackoverflow.com/questions/71947124/predict-xgboost-model-onto-raster-stack-yields-error
prediction_custom <- function(model, data, ...) {
  stats::predict(model, newdata=as.matrix(data), ...)
}

predict <- function(model, prediction_data, plots=FALSE)
{

  #model = xgb.load(paste("out/", sp, "/" ,sp, ".model", sep = ""))
  #model = readRDS(paste("out/", sp, "/" ,sp, ".bin", sep = ""))
  prediction = terra::predict(object=prediction_data,
                       model=model,
                       fun=prediction_custom
  )
  
  terra::writeRaster(prediction, paste("out/", sp, ".tif", sep = ""), overwrite=TRUE)
  return(prediction)
}

4.4 ) Example model for species “Aglais_caschmirensis”

Three different paramter sets used for Aglais_caschmirensis

sp = "Aglais_caschmirensis"

print(xgboost_params[2,])


model <- train(modeling_data, sp, as.list(xgboost_params[2,]), save_model = FALSE)
## train_logloss: 0.420687878569378
prediction <- predict(model, prediction_rstack)
ggplot() + 
  geom_spatraster(data = prediction) +
  scale_fill_hypso_c(direction = -1,
                     limits=c(0,1),
                     name = "Prediction") +
  geom_sf(data = modeling_data %>% filter(species == sp),
          size = 1,
          shape = 1 ) +
  geom_sf(data = sf::st_as_sf(border_pak),
          fill = NA, show.legend=FALSE) +
   ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])), "\n params_index: 2"))
## SpatRaster resampled to ncells = 500703

rm(sp)
##      dataset nrounds   eta max_depth subsample gamma alpha lambda
## 2 literature    1000 0.001         5      0.75     0     0      1
##   colsample_bytree min_child_weight
## 2              0.8                1
## [21:17:33] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767: 
## Parameters: { "dataset", "nrounds" } are not used.
sp = "Aglais_caschmirensis"

print(xgboost_params[4,])

model <- train(modeling_data, sp, as.list(xgboost_params[4,]), save_model = FALSE)
## train_logloss: 0.0462355725952563
prediction <- predict(model, prediction_rstack)
ggplot() + 
  geom_spatraster(data = prediction) +
  scale_fill_hypso_c(direction = -1,
                     limits=c(0,1),
                     name = "Prediction") +
  geom_sf(data = modeling_data %>% filter(species == sp),
          size = 1,
          shape = 1 ) +
  geom_sf(data = sf::st_as_sf(border_pak),
          fill = NA, show.legend=FALSE) +
   ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])), "\n params_index: 4"))
## SpatRaster resampled to ncells = 500703

rm(sp)
##   dataset nrounds eta max_depth subsample gamma alpha lambda colsample_bytree
## 4 tuned 2    5000 0.3         5         1     2     0      1                1
##   min_child_weight
## 4                1
## [21:17:54] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767: 
## Parameters: { "dataset", "nrounds" } are not used.
sp = "Aglais_caschmirensis"

print(xgboost_params[5,])

model <- train(modeling_data, sp, as.list(xgboost_params[5,]), save_model = FALSE)
## train_logloss: 0.0216167349648276
prediction <- predict(model, prediction_rstack)
ggplot() + 
  geom_spatraster(data = prediction) +
  scale_fill_hypso_c(direction = -1,
                     limits=c(0,1),
                     name = "Prediction") +
  geom_sf(data = modeling_data %>% filter(species == sp),
          size = 1,
          shape = 1 ) +
  geom_sf(data = sf::st_as_sf(border_pak),
          fill = NA, show.legend=FALSE) +
   ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])), "\n params_index: 5"))
## SpatRaster resampled to ncells = 500703

rm(sp)
##   dataset nrounds  eta max_depth subsample gamma alpha lambda colsample_bytree
## 5 tuned 3    3000 0.15         7      0.75   1.5     0    0.5                1
##   min_child_weight
## 5                1
## [21:18:56] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767: 
## Parameters: { "dataset", "nrounds" } are not used.
  if(plot) {
    #xgb_model[["evaluation_log"]]
    
    importance <- xgb.importance(model = model)
    #xgb.plot.importance(importance_matrix = importance)
    
    #xgb.ggplot.deepness(xgb_model)
    
    #xgb.plot.multi.trees(mode = xgb_model, features_keep = 3)
    
    #library("DiagrammeRsvg", "rsvg")
    
    
    #gr <- xgb.plot.multi.trees(model=xgb_model, features_keep = 5, render=FALSE)
    #DiagrammeR::export_graph(gr, 'tree.pdf', width=600, height=1500)
    
    xgb.ggplot.shap.summary(data = as.matrix(modeling_data %>% select(-species_occurrence, -geometry)), model = model )
    ggsave(paste(sp, "_shap.png", sep=""), path=paste("out/",sp, sep=""))
    rm(importance)
  }

4.5 ) Species Richness Map

Finally we combine all predicted species occurrence into a Map that indicates how many species might occur in one pixel. First, we define a threshold above which the prediction should be considered. The prediction have been saved as tif in ‘out/.tif’ and we need to load them before modifying. After that, we can reclassify the raster with 0 and 1, based on the threshold. To get the number of species in one pixel we need to sum up all rasters into one final raster and plot it using ggplot.

time_start <- proc.time()
i <- 0

species = (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species

# xgboost needs the output dir to exist before saving model
system("mk out")

for(sp in species)
{
  time_start_species <- proc.time()
  i <- i+1
  message(paste("[", i, "/", length(species), "] ", sp, ": ", sep = ""), appendLF=F)
  
  # training and prediction of the species model
  model <- train(modeling_data, sp, as.list(xgboost_params[5,]))
  prediction <- predict(model, prediction_rstack)
  
  
  ggplot() + 
    geom_spatraster(data = prediction) +
    scale_fill_hypso_c(direction = -1,
                       limits=c(0,1),
                       name = "Prediction") +
    geom_sf(data = modeling_data %>% filter(species == sp),
            size = 1,
            shape = 1 ) +
    geom_sf(data = sf::st_as_sf(border_pak),
            fill = NA, show.legend=FALSE) +
    ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]]))))
  ggsave(paste(sp, ".png", sep=""), path="out")

  message(paste("Exceution took", proc.time() - time_start_species, "seconds"))
}
## [1/2] Acraea_issoria: train_logloss: 0.00718968736489119
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageExceution took 254.87 secondsExceution took 12.95 secondsExceution took 28.38 secondsExceution took NA secondsExceution took NA seconds
## [2/2] Aglais_caschmirensis: train_logloss: 0.0206861109287276
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageExceution took 362.95 secondsExceution took 17.81 secondsExceution took 38.18 secondsExceution took NA secondsExceution took NA seconds
message(paste("Total Exceution took", proc.time() - time_start, "seconds"))
## Total Exceution took 617.92 secondsTotal Exceution took 30.76 secondsTotal Exceution took 66.65 secondsTotal Exceution took NA secondsTotal Exceution took NA seconds
## [1] 127
## [21:19:36] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767: 
## Parameters: { "dataset", "nrounds" } are not used.
## 
## [21:20:04] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767: 
## Parameters: { "dataset", "nrounds" } are not used.
# Step 1  generate Raster Stack
# l_species will consist of all 421 species prediction rasters -> RAM usage will be insane ~ 25GB
l_species = list()

threshold = 0.5

# reclassify raster:
# value < threshold = 0
# value > threshold = 1
m <- c(0, threshold, 0,
       threshold, 1, 1)
m <- matrix(m, ncol=3, byrow=TRUE)

for(sp in (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species)
{
  # get species raster from file system
  r <- rast(paste("./out/",sp,".tif", sep = "" ))

  l_species[sp] <- terra::classify(r, m, include.lowest = TRUE)

  rm(r)
} 
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
stack = terra::rast(l_species)
stack = sum(stack)
stack = terra::mask(stack, border_pak)

ggplot() + 
  geom_spatraster(data = stack) +
  scale_fill_hypso_c(palette="spain", name="N° of species" ) +
  #scale_fill_hypso_b(name="N° of species") +
  geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
  ggtitle("Species richness of butterflies in Pakistan")
## SpatRaster resampled to ncells = 500703

ggsave("SpeciesRichnessMap.png", path="out") 
## Saving 7 x 5 in image

For how many species is no prediction made? -> count(max(raster) = 0)

5 ) Conclusion

6 ) Sources

7 ) Troubleshooting

Installing XGBoost with GPU accerlation

Pre-built binary packages are offered by XGBoost after someone makes a request on GitHub (https://github.com/dmlc/xgboost/issues/6654). Since the packages are precompiled, it should not be necessary to compile them from source. The installation still fails with error rknitr::inline_expr(“Warning in system(”sh ./configure.win”) ‘sh’ not found”)`. So there are two ways to fix this problem: - Install R-Tools and build from source - Copy src/xgboost.dll from archive (https://github.com/dmlc/xgboost/releases) into your r library manually e.g C:%USERNAME%-library\4.2

Sessioninfo

sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xgboost_1.7.5.1   fastDummies_1.6.3 tidyterra_0.4.0   sf_1.0-12        
## [5] geodata_0.5-8     terra_1.7-29      ggplot2_3.4.2     dplyr_1.1.2      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0   xfun_0.39          bslib_0.4.2        purrr_1.0.1       
##  [5] lattice_0.20-45    colorspace_2.1-0   vctrs_0.6.2        generics_0.1.3    
##  [9] htmltools_0.5.5    s2_1.1.3           yaml_2.3.7         utf8_1.2.3        
## [13] rlang_1.1.0        e1071_1.7-13       jquerylib_0.1.4    pillar_1.9.0      
## [17] glue_1.6.2         withr_2.5.0        DBI_1.1.3          wk_0.7.3          
## [21] lifecycle_1.0.3    stringr_1.5.0      munsell_0.5.0      gtable_0.3.3      
## [25] ragg_1.2.5         codetools_0.2-19   evaluate_0.21      labeling_0.4.2    
## [29] knitr_1.42         fastmap_1.1.1      class_7.3-21       fansi_1.0.4       
## [33] highr_0.10         Rcpp_1.0.10        KernSmooth_2.23-20 scales_1.2.1      
## [37] classInt_0.4-9     cachem_1.0.7       jsonlite_1.8.5     systemfonts_1.0.4 
## [41] farver_2.1.1       textshaping_0.3.6  digest_0.6.31      stringi_1.7.12    
## [45] grid_4.2.3         cli_3.6.1          tools_4.2.3        magrittr_2.0.3    
## [49] sass_0.4.6         proxy_0.4-27       tibble_3.2.1       tidyr_1.3.0       
## [53] pkgconfig_2.0.3    Matrix_1.5-3       data.table_1.14.8  rmarkdown_2.21    
## [57] rstudioapi_0.14    R6_2.5.1           units_0.8-1        compiler_4.2.3